Before starting, the user should choose a working directory, preferably a directory devoted exclusively for this tutorial. After starting an R session, change working directory, load the requisite packages and set standard options:
# If necessary, change the path below to the directory where the data files are stored. # "." means current directory. On Windows use a forward slash / instead of the usual \. workingDir <- "." setwd(workingDir) # Load packages librarian::shelf(magrittr, tidyverse, data.table, glue, WGCNA, BiocParallel, cluster) library(WGCNA) # The following setting is important, do not omit. options(stringsAsFactors = FALSE) knitr::opts_chunk$set(fig.align = "center")
In this section we illustrate simulation of expression data with a simple module structure.
We choose the basic parameters of the simulated data set: we will simulate 3000 genes in 50 samples. The genes will fall into five proper modules (labeled turquoise, blue, brown, green, and yellow) and a relatively large number of genes will be simulated outside of the proper modules ("grey" genes).
# Here are input parameters of the simulation model # number of samples or microarrays in the training data no.obs <- 50 # now we specify the true measures of eigengene significance # recall that ESturquoise=cor(y,MEturquoise) ESturquoise <- 0 ESbrown <- -.6 ESgreen <- .6 ESyellow <- 0 # Note that we don’t specify the eigengene significance of the blue module # since it is highly correlated with the turquoise module. ESvector <- c(ESturquoise, ESbrown, ESgreen, ESyellow) # number of genes nGenes1 <- 3000 # proportion of genes in the turquoise, blue, brown, green, and yellow module #respectively. simulateProportions1 <- c(0.2, 0.15, 0.08, 0.06, 0.04) # Note that the proportions don’t add up to 1. The remaining genes will be colored grey, # ie the grey genes are non-module genes. # set the seed of the random number generator. As a homework exercise change this seed. set.seed(1) # Step 1: simulate a module eigengene network. # Training Data Set I MEgreen <- rnorm(no.obs) scaledy <- MEgreen * ESgreen + sqrt(1 - ESgreen^2) * rnorm(no.obs) y <- ifelse(scaledy > median(scaledy), 2, 1) MEturquoise <- ESturquoise * scaledy + sqrt(1 - ESturquoise^2) * rnorm(no.obs) # we simulate a strong dependence between MEblue and MEturquoise MEblue <- .6 * MEturquoise + sqrt(1 - .6^2) * rnorm(no.obs) MEbrown <- ESbrown * scaledy + sqrt(1 - ESbrown^2) * rnorm(no.obs) MEyellow <- ESyellow * scaledy + sqrt(1 - ESyellow^2) * rnorm(no.obs) ModuleEigengeneNetwork1 <- data.frame(y, MEturquoise, MEblue, MEbrown, MEgreen, MEyellow)
The variable ModuleEigengeneNetwork1
contains the "seed" eigengenes and a simulated clinical trait y
. The eigengene network can be simply inspected by:
signif(cor(ModuleEigengeneNetwork1, use="p"),2)
The package contains a convenient function to simulate five modules which we call below:
dat1 <- simulateDatExpr5Modules( MEturquoise = ModuleEigengeneNetwork1$MEturquoise, MEblue = ModuleEigengeneNetwork1$MEblue, MEbrown = ModuleEigengeneNetwork1$MEbrown, MEyellow = ModuleEigengeneNetwork1$MEyellow, MEgreen = ModuleEigengeneNetwork1$MEgreen, nGenes = nGenes1, simulateProportions = simulateProportions1 )
The simulated data (dat1
) is a list with the following components:
names(dat1)
We attach the data "into the main search path" so we can use the component names directly, without referring to dat1
:
datExpr <- dat1$datExpr truemodule <- dat1$truemodule datME <- dat1$datME attach(ModuleEigengeneNetwork1)
To see what is in the data, we can simply type
table(truemodule) dim(datExpr)
The output indicated we simulated 3000 genes in 50 samples, and a large number of the genes are "grey", i.e., genes that are not part of any proper module. the next piece of code assigns gene and sample names to columns and rows of the expression data:
datExpr <- data.frame(datExpr) ArrayName <- paste("Sample", 1:dim(datExpr)[[1]], sep = "") # The following code is useful for outputting the simulated data GeneName <- paste("Gene", 1:dim(datExpr)[[2]], sep = "") dimnames(datExpr)[[1]] <- ArrayName dimnames(datExpr)[[2]] <- GeneName
In this section we illustrate loading of expression data and clinical trait. The files holding the information are provided on the main tutorial page.
datGeneSummary <- read.csv("../data/GeneSummaryTutorial.csv") datTraits <- read.csv("../data/TraitsTutorial.csv") datMicroarrays <- read.csv("../data/MicroarrayDataTutorial.csv")
A quick look at the content of datMicroarrays
:
datMicroarrays[1:5,1:5]
We now reformat the data and set appropriate gene and sample names:
# This vector contains the microarray sample names ArrayName <- names(data.frame(datMicroarrays[, -1])) # This vector contains the gene names GeneName <- datMicroarrays$GeneName # We transpose the data so that the rows correspond to samples and the columns correspond to genes # Since the first column contains the gene names, we exclude it. datExpr <- data.frame(t(datMicroarrays[, -1])) names(datExpr) <- datMicroarrays[, 1] dimnames(datExpr)[[1]] <- names(data.frame(datMicroarrays[, -1])) # Also, since we simulated the data, we know the true module color: truemodule <- datGeneSummary$truemodule
The first few entries in datExpr
are now
datExpr[1:5,1:5]
The input gene expression data should have the above format where column names correspond to gene (or probe) names, row names correspond to array names. We now isolate the microarray trait y
from the read data
# First, make sure that the array names in the file datTraits line up with those in the microarray data table(dimnames(datExpr)[[1]] == datTraits$ArrayName) # Next, define the microarray sample trait y <- datTraits$Trait.y
Because the loaded data are identical to the simulated ones, we do not need to save the results here. Subsequent sections of the tutorial use the results of the data simulation section (Section 1).
In this section we illustrate basic data cleaning and pre-processing steps for expression data.
We start by determining the mean expression per array and the number of missing values per array:
meanExpressionByArray <- apply(datExpr, 1, mean, na.rm = T) NumberMissingByArray <- apply(is.na(data.frame(datExpr)), 1, sum)
A simple way to examine the mean expression per array is to use ```r"} tibble(sample = 1:length(meanExpressionByArray), meanexpression = meanExpressionByArray) %>% ggplot(aes(x = as.factor(sample), y = meanexpression)) + geom_col() + labs(x = "Sample", y = "Mean expression", title = "Mean expression across samples") + theme(panel.background = element_rect(fill = NA, color = 'black'), axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
whose output is shown in Figure 3.1\ref{fig3.1}. No arrays in the plot seem to have an outlying mean expression value. The numbers of missing entries in each array are ```r NumberMissingByArray
We note that arrays with excessive numbers of missing data should be removed, for example as
# Keep only arrays containing less than 500 missing entries KeepArray <- NumberMissingByArray < 500 table(KeepArray) datExpr <- datExpr[KeepArray, ] y <- y[KeepArray] ArrayName[KeepArray]
Here we count the number of missing samples in each probe profile, and remove probes with extensive numbers of missing samples. In addition, we remove probes that do not vary at all.
NumberMissingByGene <- apply(is.na(data.frame(datExpr)), 2, sum) # One could do a barplot(NumberMissingByGene), but the barplot is empty in this case. # It may be better to look at the numbers of missing samples using the summary method: summary(NumberMissingByGene) # Calculate the variances of the probes and the number of present entries variancedatExpr <- as.vector(apply(as.matrix(datExpr), 2, var, na.rm = T)) no.presentdatExpr <- as.vector(apply(!is.na(as.matrix(datExpr)), 2, sum)) # Another way of summarizing the number of pressent entries table(no.presentdatExpr) # Keep only genes whose variance is non-zero and have at least 4 present entries KeepGenes <- variancedatExpr > 0 & no.presentdatExpr >= 4 table(KeepGenes) datExpr <- datExpr[, KeepGenes] GeneName <- GeneName[KeepGenes]
In this case, since the data is simulated without missing data or zero-variance probes, all probes are retained.
We use hierarchical clustering with the Euclidean distance to determine whether there are array (sample) outliers: ```r"} sizeGrWindow(9, 5) plotClusterTreeSamples(datExpr = datExpr, y = y)
The output is shown in Figure 3.2\ref{fig3.2}. There are no no obvious array outliers. If there are suscpicious samples, they should be removed from the analysis. In the figure, the microarray samples are colored by the outcome `y` (1=black, 2=red). Since the colors dont line up with the branches, we find no evidence that samples with `y` = 1 are "globally distinct" from those with `y` = 2. When clustering microarray samples, we recommend to use the Euclidean distance. In contrast, we recommend to use the topological overlap based dissimilarity for clustering genes. Below, we investigate different methods for clustering genes (gene expression profiles). # Standard gene screening ## Standard gene screening based on marginal correlation In this section we use marginal Pearson to relate genes to the trait `y`. ```r GS1 <- as.numeric(cor(y, datExpr, use = "p")) # Network terminology: GS1 will be referred to as signed gene significance measure p.Standard <- corPvalueFisher(GS1, nSamples = length(y)) # since the q-value function has problems with missing data, we use the following trick p.Standard2 <- p.Standard p.Standard2[is.na(p.Standard)] <- 1 q.Standard <- qvalue(p.Standard2)$qvalues # Form a data frame to hold the results StandardGeneScreeningResults <- data.frame(GeneName, PearsonCorrelation = GS1, p.Standard, q.Standard)
We take a quick look at the content of StandardGeneScreeningResults
:
head(StandardGeneScreeningResults)
We note that gene screening based on the Pearson correlation GS1
is equivalent to screening based on the Student T test statistic for a fixed sample size. The reason is that the Student T-test statistic is $t = \sqrt{m − 2}∗GS1/\sqrt{1 − GS1^2}$, where m
is the number of microarrays. Note that the t
statistic is a monotonic function of the correlation GS1 since m
is fixed for each gene.
We now determine how many noise genes are in the top say 100 genes returned by standard screening. Recall that the eigengene significances were
tibble(ESturquoise, ESbrown, ESgreen, ESyellow)
This implies that grey non-module genes, turquoise genes, the related blue genes, and yellow genes are noise. The following vector indicates which genes are noise (as simulated):
NoiseGeneIndicator <- is.element(truemodule, c("turquoise", "blue", "yellow", "grey")) + .0 SignalGeneIndicator <- 1 - NoiseGeneIndicator
Note the proportion of noise among the top 20 most significant genes:
mean(NoiseGeneIndicator[rank(p.Standard)<=20]) mean(NoiseGeneIndicator[rank(p.Standard)<=200]) mean(NoiseGeneIndicator[rank(p.Standard)<=100])
This implies that 48% of the top 100 most significant genes are noise. That is quite bad.
How many noise genes have a q-value less than or equal to 0.20? The following code will give the answer:
table(q.Standard < .20) mean(NoiseGeneIndicator[q.Standard <= 0.20])
Only 3 genes have a q-value smaller than .2. Two of the three are noise genes.
Discussion of the performance of standard screening. The performance of standard screening is unsatisfactory. For example, among the top 100 most significant genes (lowest p-value), 55% represent noise. One major reason for this poor performance is that a standard analysis ignores the correlations between gene expression profiles. In other words, it ignores the module structure inherent in the data. We find it biologically and statistically more meaningful to carry out a module based analysis, which first identifies modules and next uses module membership to screen for meaningful genes [1, 2]. WGCNA is a systems biologic gene screening method that makes use of module membership information (or equivalently intramodular connectivity). In the following sections, we will review clustering procedures, module detection methods, module eigengenes, module membership, intramodular connectivity and finally network based gene screening methods.
In the following, we describe how to carry out a detailed WGCNA analysis. This analysis reviews basic clustering procedures and provides an in-depth look at important concepts used by WGCNA.
In this section we provide a step-by-step overview of gene network construction and module detection.
We illustrate network construction and evaluation of scale free topology. Recall that the genes correspond to the columns of datExpr
.
```r"}
ADJ1 <- abs(cor(datExpr, use = "p"))^6
k <- as.vector(apply(ADJ1, 2, sum, na.rm = T))
k <- softConnectivity(datE = datExpr, power = 6)
sizeGrWindow(10, 5) par(mfrow = c(1, 2)) hist(k) scaleFreePlot(k, main = "Check scale free topology\n")
The resulting plot is shown in Figure 5.1\ref{fig5.1}. The approximate straight line relationship (high *R^2* value, right panel in Figure 5.1) shows approximate scale free topology. In most applications we find that scale free topology is at least approximately satisfied when a high power is chosen for defining the adjacency matrix. We should point out that is not necessary that a network satisfies scale free topology; scale free topology may not be satisfied if the data are comprised of globally very distinct groups of samples (e.g. different tissues types). Poor fit to scale free topology may also indicate the presence of array outliers. ## Computational restrictions on the number of genes For computational reasons, we restrict the network analysis to 3600 most connected genes ```r datExpr <- datExpr[, rank(-k, ties.method = "first") <= 3600]
This restriction is only necessary when constructing the network in a step-by-step, "manual" way. The user can also use the automatic one-step function blockwiseModules
that is able to handle large data sets. We refer the reader to Tutorials I, II (mouse data analysis) for examples of using the automatic function on large data sets.
Many clustering procedures require a dissimilarity matrix as input. We define a dissimilarity based on adjacency:
# Turn adjacency into a measure of dissimilarity dissADJ <- 1 - ADJ1
Adjacency can be used to define a separate measure of similarity, the Topological Overlap Matrix(TOM) [2, 1]:
#TOMdist doesn't work; however, `1-TOMsimilarity` is the same thing. dissTOM <- 1-TOMsimilarity(ADJ1, TOMType = "unsigned")
In the following subsections we show that Partitioning Around Medoids (PAM) is not a good choice for clustering genes because it forces all genes into a module. We then illustrate the use of hierarchical clustering and several branch cut methods.
We use PAM with 3 different numbers of clusters:
pam4 <- pam(as.dist(dissADJ), 4) pam5 <- pam(as.dist(dissADJ), 5) pam6 <- pam(as.dist(dissADJ), 6) # Cross-tabulte the detected and the true (simulated) module membership: table(pam4$clustering, truemodule) table(pam5$clustering, truemodule) table(pam6$clustering, truemodule)
The rows correspond to clusters found by PAM, while the columns correspond to simulated modules. Numbers in the table are gene counts in the intersection of the respective simulated and found modules. PAM performs terribly on this data set since the grey, non-module genes are "forced" into the observed modules (corresponding to the rows). In general, partitioning methods that dont allow for unclustered objects will have a problem on this simulated data set.
We again use PAM with the same three different numbers of clusters:
pamTOM4 <- pam(as.dist(dissTOM), 4) pamTOM5 <- pam(as.dist(dissTOM), 5) pamTOM6 <- pam(as.dist(dissTOM), 6) # Cross-tabulte the detected and the true (simulated) module membership: table(pamTOM4$clustering, truemodule) table(pamTOM5$clustering, truemodule) table(pamTOM6$clustering, truemodule)
As above, the rows correspond to clusters found by PAM, while the columns correspond to simulated modules. Numbers in the table are gene counts in the intersection of the respective simulated and found modules. The performance of pam is not good. This is why we will use hierarchical clustering.
Here we illustrate the use of average linkage hierachical clustering. ```r"} hierADJ <- hclust(as.dist(dissADJ), method = "average")
sizeGrWindow(10, 5) plotDendroAndColors(hierADJ, colors = data.frame(truemodule), dendroLabels = FALSE, hang = 0.03, main = "Gene hierarchical clustering dendrogram and simulated module colors" )
The result is shown in Figure 5.2\ref{fig5.2}. Note that the branches correspond to the true modules. The question now is how should we cut the branches of the dendrogram to determine module membership? ### Module definition via static (fixed) height cut-off By our definition, modules correspond to branches of the tree. The question is what height cut-off should be used? This depends on the biology. Large heigth values lead to big modules, small values lead to small but tight modules. In reality, the user should use different thresholds `h1` (red below) to assess how robust the findings are. The function `cutreeStaticColor` colors each gene by the branches that result from choosing a particular height cut-off. The label "grey" is reserved to color genes that are not part of any module. Here we only consider modules that contain at least minSize genes ```r"} colorStaticADJ <- as.character(cutreeStaticColor(hierADJ, cutHeight = .99, minSize = 20)) # Plot the dendrogram with module colors sizeGrWindow(10, 5) plotDendroAndColors(hierADJ, colors = data.frame(truemodule, colorStaticADJ), dendroLabels = FALSE, abHeight = 0.99, main = "Gene dendrogram and module colors" )
The resulting plot is shown in Figure 5.3\ref{fig5.3}. The static height cut-off method works quite well at retrieving the true modules. More precisely, it works well at retrieving highly connected intramodular hub genes in the modules. However, it misses a lot of genes at the fringes of the modules.
We now use two Dynamic Tree Cut methods in which the height cut-off plays a minor role. The first method is called the "tree" method and only uses the dendrogram as input.
branch.number <- cutreeDynamic(hierADJ, method = "tree") # This function transforms the branch numbers into colors colorDynamicADJ <- labels2colors(branch.number)
The second method is called "hybrid" and is a hybrid between hclust and pam. As input it requires both a dendrogram and the dissimilarity that was used to create the dendrogram. ```r"} colorDynamicHybridADJ <- labels2colors(cutreeDynamic(hierADJ, distM = dissADJ, cutHeight = 0.998, deepSplit = 2, pamRespectsDendro = FALSE ))
sizeGrWindow(10, 5) plotDendroAndColors( dendro = hierADJ, colors = data.frame( truemodule, colorStaticADJ, colorDynamicADJ, colorDynamicHybridADJ ), dendroLabels = FALSE, marAll = c(0.2, 8, 2.7, 0.2), main = "Gene dendrogram and module colors" )
The plot is shown in Figure 5.4\ref{fig5.4}. The static method (3rd band) has high specificity but low sensitivity, i.e. its module membership assignment is very accurate but it misses a lot of genes (too many grey genes). In contrast, the dynamic hybrid method (first band) has high sensitivity but low specificity. Actually, we simulated the grey genes so that they would have a weak "background" correlation with the turquoise module genes. Therefore, it comes as no surprise that the hybrid method assigns turquoise to many grey module genes. The "tree" method (second color band) does not perform very well since it splits some of the branches into two sub-branches. As we demonstrate below, one should always look at the tree and the correlations between the module eigengenes to determine whether two modules should be merged. ### Module definition using the topological overlap based dissimilarity We now use the topological overlap based dissimilarity as input to the clustering methods we used above ```r"} # Calculate the dendrogram hierTOM <- hclust(as.dist(dissTOM), method = "average") # The reader should vary the height cut-off parameter h1 # (related to the y-axis of dendrogram) in the following colorStaticTOM <- as.character(cutreeStaticColor(hierTOM, cutHeight = .99, minSize = 20)) colorDynamicTOM <- labels2colors(cutreeDynamic(hierTOM, method = "tree")) colorDynamicHybridTOM <- labels2colors(cutreeDynamic(hierTOM, distM = dissTOM, cutHeight = 0.998, deepSplit = 2, pamRespectsDendro = FALSE )) # Now we plot the results sizeGrWindow(10, 5) plotDendroAndColors(hierTOM, colors = data.frame( truemodule, colorStaticTOM, colorDynamicTOM, colorDynamicHybridTOM ), dendroLabels = FALSE, marAll = c(1, 8, 3, 1), main = "Gene dendrogram and module colors, TOM dissimilarity" )
The result is shown in Figure 5.5\ref{fig5.5}. The dynamic hybrid method leads to modules that are slightly too large. In practice this would not be a problem since one typically focuses on the tip of the branches (highly connected hub genes). Also it would have a very small effect on the definition of the module eigengenes.
The answer depends on threshold parameters used for branch cutting. For illustration, we carry out a brute force comparison. In the following, we create tables for relating the different module assignements to the true module colors.
tabStaticADJ <- table(colorStaticADJ, truemodule) tabStaticTOM <- table(colorStaticTOM, truemodule) tabDynamicADJ <- table(colorDynamicADJ, truemodule) tabDynamicTOM <- table(colorDynamicTOM, truemodule) tabDynamicHybridADJ <- table(colorDynamicHybridADJ, truemodule) tabDynamicHybridTOM <- table(colorDynamicHybridTOM, truemodule)
Next, we use the (unadjusted) Rand index to measure agreement. This computes the Rand index for each table:
randIndex(tabStaticADJ, adjust = F) randIndex(tabStaticTOM, adjust = F) randIndex(tabDynamicADJ, adjust = F) randIndex(tabDynamicTOM, adjust = F) randIndex(tabDynamicHybridADJ, adjust = F) randIndex(tabDynamicHybridTOM, adjust = F)
In this data set, dissTOM performs better than dissADJ for the first two branch cutting method. The results for the Dynamic Hybrid methods are very similar. In the following, we will proceed with the observed modules from DynamicHybridTOM. Here is the cross tabulation of colorDynamicHybridTOM with the true module assignment:
tabDynamicHybridTOM
We define a shorter name for the module assignemnt of choice:
colorh1 <- colorDynamicHybridTOM
datME <- moduleEigengenes(datExpr, colorh1)$eigengenes signif(cor(datME, use = "p"), 2)
dissimME <- (1 - t(cor(datME, method = "p"))) / 2 hclustdatME <- hclust(as.dist(dissimME), method = "average") # Plot the eigengene dendrogram par(mfrow = c(1, 1)) plot(hclustdatME, main = "Clustering tree based of the module eigengenes")
sizeGrWindow(8, 9) plotMEpairs(datME, y = y)
signif(cor(datME, ModuleEigengeneNetwork1[, -1]), 2)
sizeGrWindow(8, 9) par(mfrow = c(3, 1), mar = c(1, 2, 4, 1)) which.module <- "turquoise" plotMat(t(scale(datExpr[, colorh1 == which.module ])), nrgcols = 30, rlabels = T, clabels = T, rcols = which.module, title = which.module ) # for the second (blue) module we use which.module <- "blue" plotMat(t(scale(datExpr[, colorh1 == which.module ])), nrgcols = 30, rlabels = T, clabels = T, rcols = which.module, title = which.module ) which.module <- "brown" plotMat(t(scale(datExpr[, colorh1 == which.module ])), nrgcols = 30, rlabels = T, clabels = T, rcols = which.module, title = which.module )
sizeGrWindow(8, 7) which.module <- "green" ME <- datME[, paste("ME", which.module, sep = "")] par(mfrow = c(2, 1), mar = c(0.3, 5.5, 3, 2)) plotMat(t(scale(datExpr[, colorh1 == which.module ])), nrgcols = 30, rlabels = F, rcols = which.module, main = which.module, cex.main = 2 ) par(mar = c(5, 4.2, 0, 0.7)) barplot(ME, col = which.module, main = "", cex.main = 2, ylab = "eigengene expression", xlab = "array sample" )
signif(cor(y, datME, use = "p"), 2)
cor.test(y, datME$MEbrown)
p.values <- corPvalueStudent(cor(y, datME, use = "p"), nSamples = length(y))
GS1 <- as.numeric(cor(y, datExpr, use = "p")) GeneSignificance <- abs(GS1) # Next module significance is defined as average gene significance. ModuleSignificance <- tapply(GeneSignificance, colorh1, mean, na.rm = T)
sizeGrWindow(8, 7) par(mfrow = c(1, 1)) plotModuleSignificance(GeneSignificance, colorh1)
ADJ1 <- abs(cor(datExpr, use = "p"))^6 Alldegrees1 <- intramodularConnectivity(ADJ1, colorh1) head(Alldegrees1)
colorlevels <- unique(colorh1) sizeGrWindow(9, 6) par(mfrow = c(2, as.integer(0.5 + length(colorlevels) / 2))) par(mar = c(4, 5, 3, 1)) for (i in c(1:length(colorlevels))) { whichmodule <- colorlevels[[i]] restrict1 <- (colorh1 == whichmodule) verboseScatterplot(Alldegrees1$kWithin[restrict1], GeneSignificance[restrict1], col = colorh1[restrict1], main = whichmodule, xlab = "Connectivity", ylab = "Gene Significance", abline = TRUE ) }
datKME <- signedKME(datExpr, datME, outputColumnName = "MM.") # Display the first few rows of the data frame head(datKME)
FilterGenes <- abs(GS1) > .2 & abs(datKME$MM.brown) > .8 table(FilterGenes)
dimnames(data.frame(datExpr))[[2]][FilterGenes]
sizeGrWindow(8, 6) par(mfrow = c(2, 2)) # We choose 4 modules to plot: turquoise, blue, brown, green. # For simplicity we write the code out explicitly for each module. which.color <- "turquoise" restrictGenes <- colorh1 == which.color verboseScatterplot(Alldegrees1$kWithin[ restrictGenes], (datKME[restrictGenes, paste("MM.", which.color, sep = "")])^6, col = which.color, xlab = "Intramodular Connectivity", ylab = "(Module Membership)^6" ) which.color <- "blue" restrictGenes <- colorh1 == which.color verboseScatterplot(Alldegrees1$kWithin[ restrictGenes], (datKME[restrictGenes, paste("MM.", which.color, sep = "")])^6, col = which.color, xlab = "Intramodular Connectivity", ylab = "(Module Membership)^6" ) which.color <- "brown" restrictGenes <- colorh1 == which.color verboseScatterplot(Alldegrees1$kWithin[ restrictGenes], (datKME[restrictGenes, paste("MM.", which.color, sep = "")])^6, col = which.color, xlab = "Intramodular Connectivity", ylab = "(Module Membership)^6" ) which.color <- "green" restrictGenes <- colorh1 == which.color verboseScatterplot(Alldegrees1$kWithin[ restrictGenes], (datKME[restrictGenes, paste("MM.", which.color, sep = "")])^6, col = which.color, xlab = "Intramodular Connectivity", ylab = "(Module Membership)^6" )
NS1 <- networkScreening( y = y, datME = datME, datExpr = datExpr, oddPower = 3, blockSize = 1000, minimumSampleSize = 4, addMEy = TRUE, removeDiag = FALSE, weightESy = 0.5 )
# network screening analysis mean(NoiseGeneIndicator[rank(NS1$p.Weighted, ties.method = "first") <= 100]) # standard analysis based on the correlation p-values (or Student T test) mean(NoiseGeneIndicator[rank(NS1$p.Standard, ties.method = "first") <= 100])
topNumbers <- c(10, 20, 50, 100) for (i in c(1:length(topNumbers))) { print(paste("Proportion of noise genes in the top", topNumbers[i], "list")) WGCNApropNoise <- mean(NoiseGeneIndicator[rank(NS1$p.Weighted, ties.method = "first") <= topNumbers[i]]) StandardpropNoise <- mean(NoiseGeneIndicator[rank(NS1$p.Standard, ties.method = "first") <= topNumbers[i]]) print(paste( "WGCNA, proportion of noise=", WGCNApropNoise, ", Standard, prop. noise=", StandardpropNoise )) if (WGCNApropNoise < StandardpropNoise) print("WGCNA wins") if (WGCNApropNoise == StandardpropNoise) print("both methods tie") if (WGCNApropNoise > StandardpropNoise) print("standard screening wins") }
# Form a data frame containing standard and network screening results CorPrediction1 <- data.frame(GS1, NS1$cor.Weighted) cor.Weighted <- NS1$cor.Weighted # Plot the comparison sizeGrWindow(8, 6) verboseScatterplot(cor.Weighted, GS1, main = "Network-based weighted correlation versus Pearson correlation\n", col = truemodule, cex.main = 1.2 ) abline(0, 1)
set.seed(2) nSamples2 <- 2000 MEgreen <- rnorm(nSamples2) scaledy2 <- MEgreen * ESgreen + sqrt(1 - ESgreen^2) * rnorm(nSamples2) y2 <- ifelse(scaledy2 > median(scaledy2), 2, 1) MEturquoise <- ESturquoise * scaledy2 + sqrt(1 - ESturquoise^2) * rnorm(nSamples2) # we simulate a strong dependence between MEblue and MEturquoise MEblue <- .6 * MEturquoise + sqrt(1 - .6^2) * rnorm(nSamples2) MEbrown <- ESbrown * scaledy2 + sqrt(1 - ESbrown^2) * rnorm(nSamples2) MEyellow <- ESyellow * scaledy2 + sqrt(1 - ESyellow^2) * rnorm(nSamples2) # Put together a data frame of eigengenes ModuleEigengeneNetwork2 <- data.frame(y = y2, MEturquoise, MEblue, MEbrown, MEgreen, MEyellow) # Simulate the expression data dat2 <- simulateDatExpr5Modules( MEturquoise = ModuleEigengeneNetwork2$MEturquoise, MEblue = ModuleEigengeneNetwork2$MEblue, MEbrown = ModuleEigengeneNetwork2$MEbrown, MEyellow = ModuleEigengeneNetwork2$MEyellow, MEgreen = ModuleEigengeneNetwork2$MEgreen, simulateProportions = simulateProportions1, nGenes = nGenes1 ) # recall that this is the signed gene significance in the training data GS1 <- as.numeric(cor(y, datExpr, use = "p")) # the following is the signed gene significance in the test data GS2 <- as.numeric(cor(ModuleEigengeneNetwork2$y, dat2$datExpr, use = "p"))
sizeGrWindow(8, 6) par(mfrow = c(1, 1)) verboseScatterplot(GS1, GS2, main = "Trait-based gene significance in test set vs. training set\n", xlab = "Training set gene significance", ylab = "Test set gene significance", col = truemodule, cex.main = 1.4 )
EvaluationGeneScreening1 <- corPredictionSuccess( corPrediction = CorPrediction1, corTestSet = GS2, topNumber = seq(from = 20, to = 500, length = 30) ) par(mfrow = c(2, 2)) listcomp <- EvaluationGeneScreening1$meancorTestSetOverall matplot( x = listcomp$topNumber, y = listcomp[, -1], main = "Predicting positive and negative correlations", ylab = "mean cor, test data", xlab = "top number of genes in the training data" ) listcomp <- EvaluationGeneScreening1$meancorTestSetPositive matplot( x = listcomp$topNumber, y = listcomp[, -1], main = "Predicting positive correlations", ylab = "mean cor, test data", xlab = "top number of genes in the training data" ) listcomp <- EvaluationGeneScreening1$meancorTestSetNegative matplot( x = listcomp$topNumber, y = listcomp[, -1], main = "Predicting negative correlations", ylab = "mean cor, test data", xlab = "top number of genes in the training data" )
relativeCorPredictionSuccess( corPredictionNew = NS1$cor.Weighted, corPredictionStandard = GS1, corTestSet = GS2, topNumber = c(10, 20, 50, 100, 200, 500) )
# Create a data frame holding the results of gene screening GeneResultsNetworkScreening <- data.frame(GeneName = row.names(NS1), NS1) # Write the data frame into a file write.table(GeneResultsNetworkScreening, file = "GeneResultsNetworkScreening.csv", row.names = F, sep = "," ) # Output of eigengene information: datMEy <- data.frame(y, datME) eigengeneSignificance <- cor(datMEy, y) eigengeneSignificance[1, 1] <- (1 + max(eigengeneSignificance[-1, 1])) / 2 eigengeneSignificance.pvalue <- corPvalueStudent(eigengeneSignificance, nSamples = length(y)) namesME <- names(datMEy) # Form a summary data frame out1 <- data.frame(t(data.frame( eigengeneSignificance, eigengeneSignificance.pvalue, namesME, t(datMEy) ))) # Set appropriate row names dimnames(out1)[[1]][1] <- "EigengeneSignificance" dimnames(out1)[[1]][2] <- "EigengeneSignificancePvalue" dimnames(out1)[[1]][3] <- "ModuleEigengeneName" dimnames(out1)[[1]][-c(1:3)] <- dimnames(datExpr)[[1]] # Write the data frame into a file write.table(out1, file = "MEResultsNetworkScreening.csv", row.names = TRUE, col.names = TRUE, sep = ",") # Display the first few rows: head(out1)
# Write out gene information GeneName <- dimnames(datExpr)[[2]] GeneSummary <- data.frame(GeneName, truemodule, SignalGeneIndicator, NS1) write.table(GeneSummary, file = "GeneSummaryTutorial.csv", row.names = F, sep = ",") # here we output the module eigengenes and trait y without eigengene significances datTraits <- data.frame(ArrayName, datMEy) dimnames(datTraits)[[2]][2:length(namesME)] <- paste("Trait", dimnames(datTraits)[[2]][2:length(namesME)], sep = "." ) write.table(datTraits, file = "TraitsTutorial.csv", row.names = F, sep = ",") # here we output the simulated gene expression data MicroarrayData <- data.frame(GeneName, t(datExpr)) names(MicroarrayData)[-1] <- ArrayName write.table(MicroarrayData, file = "MicroarrayDataTutorial.csv", row.names = F, sep = ",")
# Perform network screening NS1GS <- networkScreeningGS(datExpr = datExpr, datME = datME, GS = GS1) # Organize its results for easier plotting GSprediction1 <- data.frame(GS1, NS1GS$GS.Weighted) GS.Weighted <- NS1GS$GS.Weighted # Plot a comparison between standard gene significance and network-weighted gene significance sizeGrWindow(8, 6) par(mfrow = c(1, 1)) verboseScatterplot(GS1, GS.Weighted, main = "Weighted gene significance vs. the standard GS\n", col = truemodule ) abline(0, 1)
EvaluationGeneScreeningGS <- corPredictionSuccess( corPrediction = GSprediction1, corTestSet = GS2, topNumber = seq(from = 20, to = 500, length = 30) ) sizeGrWindow(8, 6) par(mfrow = c(2, 2)) listcomp <- EvaluationGeneScreeningGS$meancorTestSetOverall matplot( x = listcomp$topNumber, y = listcomp[, -1], main = "Predicting positive and negative correlations", ylab = "mean cor, test data", xlab = "top number of genes in the training data" ) listcomp <- EvaluationGeneScreeningGS$meancorTestSetPositive matplot( x = listcomp$topNumber, y = listcomp[, -1], main = "Predicting positive correlations", ylab = "mean cor, test data", xlab = "top number of genes in the training data" ) listcomp <- EvaluationGeneScreeningGS$meancorTestSetNegative matplot( x = listcomp$topNumber, y = listcomp[, -1], main = "Predicting negative correlations", ylab = "mean cor, test data", xlab = "top number of genes in the training data" )
cmd1 <- cmdscale(as.dist(dissTOM), 2) sizeGrWindow(7, 6) par(mfrow = c(1, 1)) plot(cmd1, col = as.character(colorh1), main = "MDS plot", xlab = "Scaling Dimension 1", ylab = "Scaling Dimension 2" )
power <- 6 color1 <- colorDynamicTOM restGenes <- (color1 != "grey") diss1 <- 1 - TOMsimilarityFromExpr(datExpr[, restGenes], power = 6) hier1 <- hclust(as.dist(diss1), method = "average") diag(diss1) <- NA sizeGrWindow(7, 7) TOMplot(diss1^4, hier1, as.character(color1[restGenes]), main = "TOM heatmap plot, module genes" )
power <- 6 color1 <- colorDynamicTOM restGenes <- (color1 != "grey") diss1 <- 1 - adjacency(datExpr[, restGenes], power = 6) hier1 <- hclust(as.dist(diss1), method = "average") diag(diss1) <- NA sizeGrWindow(7, 7) TOMplot(diss1^4, hier1, as.character(color1[restGenes]), main = "Adjacency heatmap plot, module genes" )
sizeGrWindow(7, 7) topList <- rank(NS1$p.Weighted, ties.method = "first") <= 30 gene.names <- names(datExpr)[topList] # The following shows the correlations between the top genes plotNetworkHeatmap(datExpr, plotGenes = gene.names, networkType = "signed", useTOM = FALSE, power = 1, main = "signed correlations" )
sizeGrWindow(7, 7) # The following shows the correlations between the top genes plotNetworkHeatmap(datExpr, plotGenes = gene.names, networkType = "unsigned", useTOM = FALSE, power = 1, main = "signed correlations" )
sizeGrWindow(7, 7) # The following shows the TOM heatmap in a signed network plotNetworkHeatmap(datExpr, plotGenes = gene.names, networkType = "signed", useTOM = TRUE, power = 12, main = "C. TOM in a signed network" ) # The following shows the TOM heatmap in a unsigned network plotNetworkHeatmap(datExpr, plotGenes = gene.names, networkType = "unsigned", useTOM = TRUE, power = 6, main = "D. TOM in an unsigned network" )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.